11% of pregnant US women reported past month cigarette use in 2018.
Prenatal maternal smoking is associated with negative health outcomes, including low birth weight, asthama and neurodevelopmental and behavioral issues in exposed children.
Prenatal maternal smoking is hard to measure: social desirability bias, serum cotinine is short-lived in pregnant women, and in adult cohorts, possibility of recall bias.
Finding a reliable and specific biomarker of prenatal maternal smoking is desirable.
Previous work has found many associations between prenatal maternal smoking and DNA methylation, a type of epigenetic change to DNA where a methyl attaches to a CpG site.
Conduct an ancestry-stratified epigenome wide association study of prenatal maternal smoking and DNA methylation from child saliva samples at age nine and age fifteen.
Examine differences in methylation trajectories from age nine to fifteen of the most significant methylation sites identified in the epigenome-wide association study between those exposed vs unexposed to prenatal smoking.
Develop a saliva epigenetic classifier of prenatal smoke exposure using a machine learning approach and support vector machines and evaluate the agreement of the classifier at two time points.
Hypotheses: - I hypothesize that different CpG loci will be associated with maternal prenatal smoking in African ancestry groups than in European and Hispanic ancestry groups.
I hypothesize that the direction of associations between CpGs and prenatal maternal smoking will stay consistent between ages nine and fifteen but that the magnitude of associations will decline over time.
Finally, I hypothesize that classification of prenatal maternal smoking status based upon DNA methylation information from salivary samples will be accurate and consistent at ages nine and fifteen.
pdqc_all<-readRDS(paste0(datadir, 'OGData/', "pd_qc.rds"))
pdqc<-pdqc_all %>% filter(probe_fail_10==0 & sex==predicted_sex)
#What FF ids have methylation data?
methy_ids<-pdqc$id
#create slide variable fo pdqc
pdqc$slide <- gsub('_.*', '', pdqc$MethID)
pdqc$array<-gsub('.*_', '', pdqc$MethID)
#ids with 9visit, 15visit or both visit
visit9ids<-pdqc%>%filter(childteen=='C')%>%pull(idnum)%>%unique()
visit15ids<-pdqc%>%filter(childteen=='T')%>%pull(idnum)%>%unique()
bothvisit_ids<-visit9ids[visit9ids %in% visit15ids]
visit9only_ids<-visit9ids[!(visit9ids %in%bothvisit_ids)]
visit15only_ids<-visit15ids[!(visit15ids %in%bothvisit_ids)]#create flag in FF for having methylation data
#and flag for having 2 visits of methylation data
FF_labeled<-FF_labeled %>%
mutate(MethylData=ifelse(id %in% methy_ids, "Analysis subset", "Not in analysis subset"),
TwoVisitData=ifelse(id%in%bothvisit_ids,"Analysis subset - both visits", "Not in both visit analysis subset"))
#table(FF_labeled$MethylData)
#table(FF_labeled$TwoVisitData)if((length(visit15only_ids)+length(visit9only_ids)+length(bothvisit_ids)!=length(FF_labeled %>% filter(MethylData=="Analysis subset")%>%pull(id)))){print("WARNING ID LENGTHS DO NOT ADD UP")}
if(length(bothvisit_ids)!=length(FF_labeled %>% filter(TwoVisitData=="Analysis subset - both visits")%>%pull(id))){print("WARNING ID LENGTHS DO NOT ADD UP")}I filtered to only 1 technical replicate per person-visit, picking the replicate with the lower probe fail percentage
#technical replicates will be duplicate id + childteen variables from pdqc
pdqc<-pdqc %>% mutate(newID=paste(idnum, childteen, sep='_'))
techrep_ids<- pdqc %>% select(newID) %>% group_by(newID)%>%filter(n()>1)
techreps<-pdqc %>% filter(newID %in% techrep_ids$newID)%>%arrange(idnum,childteen)
hiPctPF<-techreps %>% group_by(newID) %>%filter(probe_fail_pct==max(probe_fail_pct))%>%pull(MethID)
pdqc_clean<-pdqc %>% filter(!(MethID %in% hiPctPF) & childteen!='M')
myMethIDs<-pdqc_clean %>% pull(MethID)batchData<-read.csv(file.path(datadir, 'Methylation_450K_array_batch_information.csv'))FF_labeled<-FF_labeled %>%mutate(smkPreg_binary=ifelse(m1g4%in% c("1 2+pk/d","2 1<pk<2","3 <1pk/d" ), "Yes", ifelse(m1g4%in%c("-2 Don't know", "-3 Missing", "-9 Not in wave"), NA, ifelse(m1g4== "4 None", "No", ifelse(m1g4=="Missing", "Missing", "FLAG")))))
createTable(compareGroups(m1g4~smkPreg_binary, data=FF_labeled))##
## --------Summary descriptives table by 'During the preg, how many cigarettes did you smoke?'---------
##
## _______________________________________________________________________________
## 1 2+pk/d 2 1<pk<2 3 <1pk/d 4 None Missing p.overall
## N=15 N=103 N=834 N=3934 N=12
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## smkPreg_binary: .
## Missing 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 12 (100%)
## No 0 (0.00%) 0 (0.00%) 0 (0.00%) 3934 (100%) 0 (0.00%)
## Yes 15 (100%) 103 (100%) 834 (100%) 0 (0.00%) 0 (0.00%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Ancestry is based on csv files of ancestry-specific PCS created by Dr. Erin Ware (per Jonah meeting Oct 6 2020). If an ID was in {african/european/hispanic}, then the individual is labeled as such.
# read in PCs
euroPC<-read.csv(paste0(datadir, "OGData/pcs/european.csv"))
afrPC<-read.csv(paste0(datadir, "OGData/pcs/african.csv"))
hisPC<-read.csv(paste0(datadir, "OGData/pcs/hispanic.csv"))
allPC<-read.table(paste0(datadir, "OGData/pcs/global_pcs.txt"), col.names=c("X", "idnum", paste("PC", 1:20, sep='')))
#individuals in each ancestry specific pc group can be labeled as that ancestry for categorical ancestry variable
FF_labeled<-FF_labeled %>% mutate(ancestry=case_when(MethylData=="Analysis subset" & id %in% hisPC$idnum~"Hispanic ancestry", MethylData=="Analysis subset" & id %in%afrPC$idnum~"African ancestry", MethylData=="Analysis subset" & id %in% euroPC$idnum~"European ancestry", MethylData=="Analysis subset" & !(id %in%c(hisPC$idnum, afrPC$idnum, euroPC$idnum)) & (id %in% allPC$idnum)~"Other ancestry", MethylData=="Analysis subset" & !(id %in%c(hisPC$idnum, afrPC$idnum, euroPC$idnum)) & !(id %in% allPC$idnum)~"Missing PC data", MethylData!="Analysis subset"~"Not in analysis subset"))
#check ancestry variable
#table(FF_labeled$ancestry)
#with(FF_labeled, xtabs(~ancestry+cm1ethrace))
MissingPCdata<-FF_labeled %>% filter(ancestry=="Missing PC data")%>%select(id)
save(MissingPCdata, file = paste0(datadir, "CreatedData/MissingPC.Rdata"))
#add PC data to FF_labeled
FF_labeled<-left_join(FF_labeled, allPC %>% mutate(id=as.character(idnum)) %>% select(id, PC1, PC2)%>%rename_at(vars(-id), function(x) paste0('global_', x, sep='')))## Joining, by = "id"
localPC<-rbind(euroPC, afrPC, hisPC)
FF_labeled<-left_join(FF_labeled, localPC %>% mutate(id=as.character(idnum)) %>% select(id, PC1, PC2)%>% rename_at(vars(-id), function(x) paste0('local_', x, sep='')))## Joining, by = "id"
There were 47 individuals without PC data. None of these individuals had data in the global PC file, so I assume they are actually missing genetic data (for whatever reason, possibly QC).
Recoded frequency of prenatal alcohol drinking and drug use into Yes/no binary variables for drinking and drug use during pregnancy.
#table(FF_labeled$m1g2) # during preg, alchohol freq
#table(FF_labeled$m1g3) # during preg, drugs freq
#with(FF_labeled, table(m1g2, m1g3))
FF_labeled<-FF_labeled %>% mutate_at(c("m1g2", "m1g3"), .funs=funs(YesNoPreg=ifelse(.=="5 Never", "No", ifelse(. =="Missing", "Missing", "Yes"))))
with(FF_labeled, table(m1g2, m1g2_YesNoPreg))## m1g2_YesNoPreg
## m1g2 Missing No Yes
## 1 E day 0 0 8
## 2 Svl/wk 0 0 27
## 3 Svl/mn 0 0 83
## 4 <1X/mn 0 0 403
## 5 Never 0 4364 0
## Missing 13 0 0
with(FF_labeled, table(m1g3, m1g3_YesNoPreg))## m1g3_YesNoPreg
## m1g3 Missing No Yes
## 1 E day 0 0 27
## 2 Svl/wk 0 0 32
## 3 Svl/mn 0 0 61
## 4 <1X/mn 0 0 151
## 5 Never 0 4616 0
## Missing 11 0 0
#table(FF_labeled$m1g5) # in last year, alc/drugs interfered with work/relationships
#table(FF_labeled$m1g6) # ever sought help for drug/alc problemsMaternal postnatal smoking at different survey points is fairly correlated, and alluvial plots of changes over time show broad consistency, so I decided to create a single smoking variable for maternal smoking at age 1 and age 5 - if mothers were smoking at either age 1 or age 5 then “Maternal smoking at age 1 or 5”, if not smoking at age 1 and age 5 then “No maternal smoking at age 1 and age 5” and then if missing data at one age and no at the other, then “Missing” (and if missing at both ages, “Missing”). Also created a similar dose variable capturing maternal packs per day that could be used instead. Decided to use just age 1 and age 5 because at these two time points the mother was asked (as opposed to PCG, also the dataset of Fragile Families metadata is missing PCG questions from age 3 about smoking) and they both preceded the actual saliva collection for both saliva collection age points. Then use a smoking dose variable from the mother or PCG interview at wave of saliva collection as an additional second hand smoke exposure variable.
#Baseline (1) -> Year 1/Age 1 (2) ->Year 3/Age3 (3) -> Year 5 (4) ->Year 9 (5) -> year 15 (6)
#Do you smoke in past month (how many packs per day)
#Mom ->m2j5 (m2j5a) -> -> m4j18 (m4j19) /p4 -> m5g17 (m5g18) / n5f17 (n5f18)
#PCG p3a23a(p3a23b) n5f17 (n5f18)-> p6h76 (p6h77)
#seem to be missing p3 variables (PCG survey year 3)
#number of people in household who smoke
#p3a23->p5h15b->p6h78
#hours child around smoke
#p3a21 -> p4a24->p5h15c
#with(FF_labeled, table(m2j5, m4j18))
#with(FF_labeled, table(m4j18, m5g17))
#with(FF_labeled, table(n5f17, m5g17))
#with(FF_labeled, table(p6h76, m5g17))
cor_plot_postnatalsmk<-plot_grid(FF_labeled %>% select(m2j5, m4j18, m5g17, n5f17, p6h74) %>% mutate_all(funs(as.numeric(.))) %>%ggcorr(label=T)+ggtitle("Correlation between Yes/No smoking (mother) \n at years 1, 5, & 9 & PCG at years 9 & 15")
,FF_labeled %>% select(m2j5a, m4j19, m5g18, n5f18, p6h77) %>% mutate_all(funs(as.numeric(.))) %>%ggcorr(label=T)+ggtitle("Correlation between packs per day smoking (mother) \n at years 1, 5, 9 & PCG at years 9 and 15"), nrow=2)
yesnosmkalluv<-to_lodes_form(as.data.frame(FF_labeled %>% select(m2j5, m4j18, m5g17, p6h74) %>% mutate(m5g17=stringr::str_to_title(m5g17))%>% group_by(m2j5, m4j18, m5g17, p6h74) %>% count()), axes=1:4, id='Cohort')
#ggplot(yesnosmkalluv, aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow(stat="alluvium", lode.guidance="backfront", color='darkgrey')+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9"))
smkdose<-to_lodes_form(as.data.frame(FF_labeled %>% select(m2j5a, m4j19, m5g18, p6h77)%>% mutate_at(vars(c("m2j5a", "m4j19", "m5g18")), funs(factor(substring(., 1, 1), labels=levels(FF_labeled$m2j5a)))) %>% group_by(m2j5a, m4j19, m5g18, p6h77) %>% count()), axes=1:4, id='Cohort')
#ggplot(smkdose, aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow()+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9"))
alluv_plot_postnatalsmk<-plot_grid(ggplot(yesnosmkalluv, aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow()+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9", "Age 15"))
+ggtitle("Alluvial plot of Yes/No smoking variables from age 1 to age 9 (mother) and age 15 (PCG)"), ggplot(smkdose %>% filter(!(stratum %in% c("-6 Skip", "Missing"))), aes(x=x, stratum = stratum, alluvium=Cohort, fill=stratum, label=stratum, y=n))+geom_flow()+geom_stratum()+scale_x_discrete(labels=c("Age 1", "Age 5", "Age 9", "Age 15"))+ggtitle("Alluvial plot of smoking per day among those smoking from age 1 to age 9 (mother) and 15 (PCG)"), nrow=2)## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
#ggplot(FF_labeled, aes(x=p5h15b, y=p6h78))+geom_jitter()+stat_cor()+scale_x_continuous('# smoking in house age 9')+scale_y_continuous('# smoking in house age 15')+ggtitle("# individuals smoking in house correlation age 9 and age 15")cor_plot_postnatalsmkalluv_plot_postnatalsmk#Make any maternal postnatal smoking variable & dose categorization for maternal smoking at ages 1 & 5
FF_labeled<-FF_labeled %>% mutate(PostnatalMaternalSmokingAny = case_when(
m2j5 =="1 Yes" ~ "Maternal smoking at age 1 or age 5",
m4j18=="1 Yes" ~ "Maternal smoking at age 1 or age 5",
m2j5 == "Missing" & m4j18=="Missing" ~ "Missing",
m2j5 == "Missing" & m4j18=="2 No" ~ "Missing",
m2j5 == "2 No" & m4j18=="Missing" ~ "Missing",
m2j5 == "2 No" & m4j18=="2 No" ~ "No maternal smoking at age 1 and age 5")) %>%
mutate(PostnatalMaternalSmokingDose=case_when(
m2j5a %in% c("2 1pk/d", "3 1.5pk/d", "4 2pk/d", "5 >2pk/d") &
m4j19 %in% c("2 About pack/day", "3 Pack and half/day", "4 2 packs/day", "5
More 2 packs/day") ~ "Mom - pack/d or greater at both age 1 & 5",
m2j5a %in% c("2 1pk/d", "3 1.5pk/d", "4 2pk/d", "5 >2pk/d") &
!(m4j19 %in% c("2 About pack/day", "3 Pack and half/day", "4 2 packs/day", "5
More 2 packs/day")) ~ "Mom - pack/d or greater at either age 1 or 5",
!(m2j5a %in% c("2 1pk/d", "3 1.5pk/d", "4 2pk/d", "5 >2pk/d")) &
m4j19 %in% c("2 About pack/day", "3 Pack and half/day", "4 2 packs/day", "5
More 2 packs/day") ~ "Mom - pack/d or greater at either age 1 or 5",
m2j5a %in% c("1 <1/2 pk/d") | m4j19 %in% c("1 Less half pack/day")~"Mom - less than half pack/d at either age 1 or 5",
m2j5a %in% c("-6 Skip") & m4j19 %in% c("-6 Skip")~
"No maternal smoking at age 1 or 5",
m2j5a %in% c("Missing") & m4j19 %in% c("-6 Skip", "Missing")~"Missing",
m2j5a %in% c("Missing", "-6 Skip") & m4j19 %in% c("Missing")~"Missing"))
with(FF_labeled, table(PostnatalMaternalSmokingAny, m2j5, m4j18))## , , m4j18 = 1 Yes
##
## m2j5
## PostnatalMaternalSmokingAny 1 Yes 2 No Missing
## Maternal smoking at age 1 or age 5 868 305 81
## Missing 0 0 0
## No maternal smoking at age 1 and age 5 0 0 0
##
## , , m4j18 = 2 No
##
## m2j5
## PostnatalMaternalSmokingAny 1 Yes 2 No Missing
## Maternal smoking at age 1 or age 5 168 0 0
## Missing 0 0 183
## No maternal smoking at age 1 and age 5 0 2526 0
##
## , , m4j18 = Missing
##
## m2j5
## PostnatalMaternalSmokingAny 1 Yes 2 No Missing
## Maternal smoking at age 1 or age 5 123 0 0
## Missing 0 371 273
## No maternal smoking at age 1 and age 5 0 0 0
#with(FF_labeled, table(m2j5a, m4j19, PostnatalMaternalSmokingDose))
#with(FF_labeled, table(PostnatalMaternalSmokingAny, PostnatalMaternalSmokingDose))Cell type proportions currently estimated using epidDISH package and the EpiFibIC reference (generic for epithelial tissues). This will be replaced with the saliva specific reference panel in Middleton et al..
betaqc<-readRDS(file=paste0(datadir, 'OGData/', "noob_filtered.rds"))
#cell type proportions
celltypes<-epidish(beta.m = betaqc, ref.m = centEpiFibIC.m, method = "RPC")
estF_FF<-as.data.frame(celltypes$estF)
estF_FF$MethID = row.names(estF_FF)In addition to the site-specific analysis proposed, Kelly and I decided to investigate certain DNA methylation summary metrics that are commonly used in the field. These include global methylation and methylation clocks (which are outcome-naive), and an ooutcome-predicated measure, polymethylation scores.
We calculated global methylation as the mean of the CpG site-specific methylation values.
aprioriCG<-c("cg05575921", "cg04180046", "cg05549655", "cg14179389")
globalmethy<-colMeans(betaqc, na.rm=T)
globalmethdf<-as.data.frame(cbind("MethID"=names(globalmethy), "globalmethylation"=globalmethy))
aprioriCGdf<-as.data.frame(t(betaqc[aprioriCG, ]))
aprioriCGdf$MethID = colnames(betaqc)Methylation or epigenetic clocks are measures which exploit certain sets of CpG sites whose DNA methylation levels associate with subject age. These clocks are highly accurate molecular correlates of chronological age. However, variation in epigenetic clock measures among individuals of the same chronological age do exist and are assumed to reflect biological age.
Four different methylation clocks were created, based on different papers that used different tissue types and age groups for their training samples: Horvath13, Horvath18, Pediatric and Levine. Since this is a pediatric population, the pediatric clock is probably the best bet, but Jonah also suggested we look at the Levine clock.
clocks<-read.csv(file=paste0(datadir, "OGData/ffcw_n1776_8clocks.csv"), header = T)
colnames(clocks)[1]<-"MethID"Here, the idea is similar to a polygenetic score. Essentially, we want to create a summary metric of CpG sites that have associated with an exposure or outcome of interest in a previous independent cohort. We identify a previous study which conducted a regression analysis of our exposure of interest on CpG sites and extract those regression coefficients. The regression coefficients become weights for the CpG site values in our cohort. We sum the weighted CpG site values across sites for each sample to come up with a single per-sample measure of exposure-related methylation.
\(Score_i=\sum^{m}_j Weight_j*CpG_{ij}\) where
\(i\) indexes samples in our study; \(j\) indexes CpG sites
\(Weight_j\) is the regression coefficient from the independent external study for that \(j\) CpG site
\(CpG_{ij}\) is the CpG methylation value for that \(j\) CpG in that \(i\) sample in our study
\(Score_i\) is the polymethylation score.
Because polymethylation scores are not widely used (yet), there isn’t a standardized approach. Importantly, much of the time researchers use the (confusingly named) ‘beta’ value to measure methylation at a CpG site, which is the ratio of the methylated probe intensity to the overall intensity and ranges from 0-1. We were concerned that this could lead to sites with more methylation (closer to 1) being upweighted in the resulting score), so we implemented versions of the score where the \(CpG_{ij}\) values were first mean-centered or z-score standardized before being used in the score.
For this analysis, we used a 2016 genome-wide consortium meta-analysisof prenatal maternal smoking and DNA methylation in blood samples, which used the same DNA methylation array. This study actually provided regression coefficients for all CpGs statistically significant after FDR correction (\(m_{CpG}=6073\)) in 4 different regressions:
These regression used normalized betas as the outcomes but found very little variation in results when using raw betas as the outcome (Spearman \(\rho\)=0.96 for regression coefficients, \(\rho\)=0.98 for \(log_{10}(pvalues\)))
#read in cpgs from PMC4833289
cpgs<-read.csv(paste0(datadir, 'PMC4833289_SuppFile2CpGs.csv'), stringsAsFactors = F, header=F, skip=2)
#make headers
header1<-scan(paste0(datadir, 'PMC4833289_SuppFile2CpGs.csv'), nlines=1, what=character(), sep=',')
header1[7:26]<-c(rep("SSnewborn", 5), rep("SSnewbornCT", 5), rep("SSolder", 5), rep("anynewborn", 5))
header2<-paste(header1, scan(paste0(datadir, 'PMC4833289_SuppFile2CpGs.csv'), nlines=1, skip=1, what=character(), sep=','), sep='_')
names(cpgs)<-header2
#four meta-EWASes: SS_newborn (sustained smoking in newborns); SS_newbornCT (sustained smoking in newborns controlling for cell type) SS_older (sustained smoking in older children); any_newborn (any smoking in newborns)
#pivot_longer to get single column of coefs and a column for the analysis type
#and group into a list by analysis type
cpgs<-cpgs %>% pivot_longer(cols = c(matches("_Coef|_SE|_P|_FDR|_Direction")), names_to = c("set", ".value"), names_pattern="(.+)_(.+)")%>%rename_with(~gsub('_', '', .x))%>%group_split(set)
cpgs<-cpgs%>%setNames(lapply(cpgs, function(x){x$set %>% unique()}))
#make scores - no transform, zscore and centermean
#first, pull script
source(file.path(codedir, 'UsefulCode', 'makePolyEpiScores.R'))
scores=makePolyEpiScore(m=betaqc, b=cpgs)## [1] "transforming the methylation matrix:none"
## [1] "transforming the methylation matrix:none"
## [1] "transforming the methylation matrix:none"
## [1] "transforming the methylation matrix:none"
scores_z=makePolyEpiScore(m=betaqc, b=cpgs, transformopt = 'zscore')## [1] "transforming the methylation matrix:zscore"
## [1] "transforming the methylation matrix:zscore"
## [1] "transforming the methylation matrix:zscore"
## [1] "transforming the methylation matrix:zscore"
scores_center=makePolyEpiScore(m=betaqc, b=cpgs, transformopt = 'meancenter')## [1] "transforming the methylation matrix:meancenter"
## [1] "transforming the methylation matrix:meancenter"
## [1] "transforming the methylation matrix:meancenter"
## [1] "transforming the methylation matrix:meancenter"
polyscores=list('notransform'=scores, 'zscore'=scores_z, 'center'=scores_center)
polyscores<-lapply(polyscores, function(x) x%>% mutate(MethID=colnames(betaqc)))
save(polyscores, file = file.path(datadir, 'CreatedData', 'polymethylationscores.RDS'))
polyscores_wide<-lapply(seq_along(polyscores), function(i) polyscores[[i]] %>% setNames(c(paste0(colnames(polyscores[[i]])[1:4], '_', names(polyscores)[i]), 'MethID')))%>%reduce(left_join, by='MethID')clocks2<-clocks %>% mutate(idnum=gsub('a', '', idnum))%>%filter(MethID %in% pdqc$MethID)%>%select(MethID, idnum, childteen, horvath, pediatric, levine, grim) # currently clocks are missing 3 individuals who were in johns pipeline but not jonahs
methyldata<-left_join(left_join(left_join(left_join(left_join(left_join(pdqc_clean %>% select(MethID, slide, array,childteen, idnum), globalmethdf), clocks2), estF_FF), aprioriCGdf), polyscores_wide), batchData)## Joining, by = "MethID"
## Joining, by = c("MethID", "childteen", "idnum")
## Joining, by = "MethID"
## Joining, by = "MethID"
## Joining, by = "MethID"
## Joining, by = "MethID"
methyldata[, c("globalmethylation", aprioriCG)]<-lapply(methyldata[, c("globalmethylation", aprioriCG)],function(x) as.numeric(as.character(x)))
#lapply(methyldata, class)
methyldata_wide <- methyldata %>% select(-MethID)%>% pivot_wider(., names_from=childteen, values_from=any_of(colnames(methyldata)[4:length(colnames(methyldata))]))#weirdID<-methyldata %>% filter(is.na(globalmethylation))%>%pull(MethID)
#globalmethy[weirdID] #not actually missing global methylation data
#clocks %>% filter(MethID%in%weirdID) # #idnum from not the same as in
#pdqc_clean %>% filter(MethID==weirdID) #this is just a typo in the idnum from data entry, see format of other #idnums
#pdqc_clean %>% pull(idnum) %>% head()
#solution: when creating methyldata, first trim alphabet from clocks$idnum that contain alphabetic characters --> implemented
#also misssing clocks for 3 sex outlier samples that weren't in jonahs' pipeline:
table(is.na(methyldata$horvath13))
weirdIDs<-methyldata %>% filter(is.na(horvath13))%>%pull(MethID)
#pdqc_all %>% filter(MethID %in% weirdID) %>% xtabs(~predicted_sex_outlier, data=.)varLabels2<-c(paste(colnames(FF_labeled)[1:66], varLabels), "Has any methylation data?", "Has two visits of methylation data", "Any maternal smoking during pregnancy", "Ancestry from PCA of child's genetic data", "PC 1 (global)", "PC 2 (global)", "PC 1 (within ancestry category)", "PC 2 (within ancestry category)",
"Any alcohol during pregnancy", "Any drugs during pregnancy", "Any postnatal maternal smoking (age 1 & 5)", "Postnatal maternal smoking dose at age 1 & 5")
FF_labeled<-set_label(FF_labeled, varLabels2)
myFF<-FF_labeled %>% filter(MethylData=="Analysis subset")%>%mutate(idnum=id)
myFF<-left_join(myFF, methyldata)## Joining, by = "idnum"
varLabels3<-c(varLabels2, "ID", "Methylation ID", "Slide", "Array (RC)", "Visit", "Global methylation", "horvath methylation clock", "pediatric methylation clock", "levine methylation clock", "GRIM methylation clock", "Epithelial cell proportion", "Fibroid cell proportion", "Immune cell proportion", "cg05575291 methylation", "cg04180046 methylation", "cg05549655 methylation", "cg14179389 methylation", "PES: Any smoking, newborn cordblood", "PES: Sustained smoking, newborn cordblood", "PES: Sustained smoking, newborn cordblood (+CT control)", "PES: Sustained smoking, older children peripheral blood", "PES: Any smoking, newborn cordblood - zscore", "PES: Sustained smoking, newborn cordblood - zscore", "PES: Sustained smoking, newborn cordblood (+CT control) - zscore", "PES: Sustained smoking, older children peripheral blood - zscore", "PES: Any smoking, newborn cordblood - mean center", "PES: Sustained smoking, newborn cordblood - mean center", "PES: Sustained smoking, newborn cordblood (+CT control) - mean center", "PES: Sustained smoking, older children peripheral blood - mean center", 'Plate', 'Slide', 'Array')
myFF<-set_label(myFF, varLabels3)chart.Correlation(myFF %>%select(globalmethylation, levine, pediatric, grim))chart.Correlation(myFF %>%select(SSnewborn_notransform, SSnewbornCT_notransform, SSolder_notransform, anynewborn_notransform))postscript(file.path(outputdir, 'CorrelationPMSWeights.ps'))
chart.Correlation(myFF %>%select(SSnewborn_notransform, SSnewbornCT_notransform, SSolder_notransform, anynewborn_notransform))
dev.off()## png
## 2
coefs_methyl<-cbind('SSolderCoefs'=cpgs$SSolder$Coef, 'SSnewbornCTCoefs'=cpgs$SSnewbornCT$Coef, 'SSnewbornCoefs'=cpgs$SSnewborn$Coef, 'anynewbornCoefs'=cpgs$anynewborn$Coef)
chart.Correlation(coefs_methyl)#lapply(polyscores, function(x){summary(x)})
#lapply(polyscores, function(x){apply(x, 2, sd)})
dist_plots=lapply(seq_along(polyscores), function(i){polyscores[[i]]%>%pivot_longer(cols=1:4, names_to='Score_Type', values_to='Score')%>%ggplot(aes(x=Score))+geom_histogram()+facet_wrap(~Score_Type)+ggtitle(names(polyscores)[[i]])+theme_classic()})
plot_grid(dist_plots[[1]], dist_plots[[2]], dist_plots[[3]], ncol=1)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
transform_labels<-c('Mean center', 'No transformation', 'Z-score standardization')
names(transform_labels)<-c('center', 'notransform', 'zscore')
postscript(file.path(outputdir, 'DistributionPMStransformation.ps'))
polyscores %>% bind_rows(.id='transform')%>%ggplot(aes(x=SSnewbornCT))+geom_histogram()+facet_wrap(~transform, scales='free', labeller=labeller(transform=transform_labels))+theme_classic()+theme(text=element_text(size=16))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
dev.off()## png
## 2
As Kelly thought, the zscore standardized have a larger range and the biggest standard deviations, the standard deviations of the no transform and centered methods are the same, which makes sense as \(V(aX+b)=a^2V(X)\implies V(\sum\beta_i*[x_i-\bar{x}])=V(\sum\beta_i*x_i-\beta_i*\bar{x})=V(\sum \beta_i*x_i-k)=V(\sum \beta_i*x_i)\).
methylation_data<-myFF %>% select(globalmethylation, pediatric, levine, Epi, Fib, IC, anynewborn_center, SSnewbornCT_center, SSolder_center)
chart.Correlation(methylation_data, histogram=T)#Cell type
cellbytime<-myFF %>% pivot_longer(cols=c(Fib, Epi, IC), names_to="Celltype", values_to = "Proportion")
celltype_labels<-c('Epithelial cells', 'Fibroblasts', 'Immune cells'); names(celltype_labels)<-c('Epi', 'Fib', 'IC')
ggplot(cellbytime, aes(x=childteen, y=Proportion, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+facet_wrap(~Celltype, labeller=labeller(Celltype=celltype_labels))+stat_compare_means(label="..p.signif..")+ggtitle("Cell type proportions by visit")+theme_classic()+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()cairo_ps(file.path(outputdir, 'SpaghettiCellTypes.ps'))
ggplot(cellbytime, aes(x=childteen, y=Proportion, group=idnum))+geom_point()+geom_line(alpha=0.1, color='grey')+geom_smooth(aes(group=1), method='lm', fill='blue')+facet_wrap(~Celltype, labeller=labeller(Celltype=celltype_labels))+theme_classic()+scale_x_discrete('', labels=c('Age 9', 'Age 15'))+ylab('Proportion')+theme(text=element_text(size=16))## `geom_smooth()` using formula 'y ~ x'
dev.off()## png
## 2
global<-ggplot(myFF, aes(x=childteen, y=globalmethylation, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Global methylation")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
pediatric<-ggplot(myFF, aes(x=childteen, y=pediatric, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Pediatric methylation clock")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
levine<-ggplot(myFF, aes(x=childteen, y=levine, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Levine methylation clock")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
plot_grid(global, pediatric, levine, nrow=1)## Warning: Removed 3 rows containing non-finite values (stat_ydensity).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_compare_means).
## Warning: Removed 3 rows containing non-finite values (stat_ydensity).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_compare_means).
#ggplot(myFF, aes(x=childteen, y=cg05575921, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Methylation at cg05575921 by visit")
any<-ggplot(myFF, aes(x=childteen, y=anynewborn_center, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Any smoking, \n newborn polymethylation score")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
SSnewborn<-ggplot(myFF, aes(x=childteen, y=SSnewbornCT_center, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Sustained smoking, \n newborn CT polymethylation score ")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
SSolder<-ggplot(myFF, aes(x=childteen, y=SSolder_center, color=childteen))+geom_violin()+geom_boxplot(width=0.2)+stat_compare_means(label="..p.signif..")+ggtitle("Sustained smoking, \n older children polymethylation score")+scale_color_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+scale_x_discrete(name='Visit', labels=c('Age 9', 'Age 15'))+theme_classic()+theme(legend.position = 'bottom')
plot_grid(any, SSnewborn,SSolder, ncol=3)Cell type proportions do differ between 9 and 15 year visits and although global methylation does not look different between these visits, methylation clocks do (a good check).
Possibly higher polymethylation scores at teen visit, although inconsistent across different score types.
PMS<-methyldata %>%select(idnum, childteen, colnames(polyscores_wide), -MethID)%>%pivot_longer(cols=any_of(colnames(polyscores_wide)), names_to="PMS_type", values_to="PolymethylationScore")%>%pivot_wider(names_from = childteen, values_from=PolymethylationScore)
PMSType_label<-c('Any smoking;\n newborn', 'Sustained smoking;\n newborn', 'Sustained smoking + cell type;\n newborn', 'Sustained smoking;\n older')
names(PMSType_label)<-c('anynewborn_center', 'SSnewborn_center', 'SSnewbornCT_center', 'SSolder_center')
ggplot(PMS%>% filter(str_detect(PMS_type, '_center')), aes(x=C, y=T))+geom_point()+facet_grid(~PMS_type, labeller=labeller(PMS_type=PMSType_label))+stat_cor()+geom_smooth(method=lm)+theme_classic()+scale_x_continuous('Age 9 score')+scale_y_continuous('Age 15 score')## Warning: Removed 300 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 300 rows containing non-finite values (stat_smooth).
## Warning: Removed 300 rows containing missing values (geom_point).
postscript(file.path(outputdir, 'ScoreCorrelationVisits.ps'))
ggplot(PMS%>% filter(str_detect(PMS_type, '_center')), aes(x=C, y=T))+geom_point()+facet_grid(~PMS_type, labeller=labeller(PMS_type=PMSType_label))+stat_cor()+geom_smooth(method=lm)+theme_classic()+scale_x_continuous('Age 9 score')+scale_y_continuous('Age 15 score')+theme(text=element_text(size=16))## Warning: Removed 300 rows containing non-finite values (stat_cor).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 300 rows containing non-finite values (stat_smooth).
## Warning: Removed 300 rows containing missing values (geom_point).
## Warning in grid.Call.graphics(C_polygon, x$x, x$y, index): semi-transparency is
## not supported on this device: reported only once per page
dev.off()## png
## 2
methyldata %>% pivot_longer(cols=contains('_center'), names_to='measure', values_to='Score')%>%ggplot(aes(x=childteen, y=Score, group=idnum))+geom_point()+geom_line(alpha=0.1, color='grey')+geom_smooth(aes(group=1), method='lm', fill='blue')+facet_wrap(~measure, labeller=labeller(measure=PMSType_label))+theme_classic()+scale_x_discrete('', labels=c('Age 9', 'Age 15'))+ylab('Score')## `geom_smooth()` using formula 'y ~ x'
cairo_ps(file.path(outputdir, 'SpaghettiScores.ps'))
methyldata %>% pivot_longer(cols=contains('_center'), names_to='measure', values_to='Score')%>%ggplot(aes(x=childteen, y=Score, group=idnum))+geom_point()+geom_line(alpha=0.1, color='grey')+geom_smooth(aes(group=1), method='lm', fill='blue')+facet_wrap(~measure, labeller=labeller(measure=PMSType_label))+theme_classic()+scale_x_discrete('', labels=c('Age 9', 'Age 15'))+ylab('Score')+theme(text=element_text(size=16))## `geom_smooth()` using formula 'y ~ x'
dev.off()## png
## 2
myFF %>% filter(is.na(cp6yagem) &is.na(hv6_yagem) & childteen=='T')%>%dim() #3## [1] 3 113
myFF %>% filter(is.na(cm5b_age) &is.na(hv5_agem) &childteen=='C')%>%dim() #1 ## [1] 0 113
myFF %>% filter(!is.na(cm5b_age) & is.na(hv5_agem) & childteen=='C')%>%dim()#1## [1] 1 113
myFF %>% filter(!is.na(cp6yagem) &is.na(hv6_yagem))%>%dim() #990## [1] 976 113
myFF <- myFF %>%
mutate(ChildAgeAtIn=case_when(childteen=='C'~ cm5b_age, childteen=='T'~cp6yagem))%>%
mutate(ChildAgeAtVisit=case_when(childteen=='C'~hv5_agem, childteen=='T'~hv6_yagem))%>%
mutate(ChildAgeComposite=case_when(childteen=='C' & !is.na(hv5_agem)~hv5_agem,
childteen=='C' & is.na(hv5_agem)~cm5b_age,
childteen=='T' & !is.na(hv6_yagem)~hv6_yagem,
childteen=='T' & is.na(hv6_yagem)~cp6yagem))
myFF %>% filter(childteen=='C')%>%group_by(ChildAgeAtVisit, ChildAgeComposite)%>%summarise(n=n())%>%filter(ChildAgeAtVisit!=ChildAgeComposite)## `summarise()` has grouped output by 'ChildAgeAtVisit'. You can override using the `.groups` argument.
## # A tibble: 0 × 3
## # Groups: ChildAgeAtVisit [0]
## # … with 3 variables: ChildAgeAtVisit <dbl>, ChildAgeComposite <dbl>, n <int>
myFF %>% filter(childteen=='T')%>%group_by(ChildAgeAtVisit, ChildAgeComposite)%>%summarise(n=n())%>%filter(ChildAgeAtVisit!=ChildAgeComposite)## `summarise()` has grouped output by 'ChildAgeAtVisit'. You can override using the `.groups` argument.
## # A tibble: 0 × 3
## # Groups: ChildAgeAtVisit [0]
## # … with 3 variables: ChildAgeAtVisit <dbl>, ChildAgeComposite <dbl>, n <int>
ggplot(myFF%>%filter(childteen=='C'), aes(cm5b_age, hv5_agem))+geom_point()+stat_cor()+ggtitle("Correlation between childs age at mom interview (9 yr) and childs age at home visit (9 yr)")## Warning: Removed 23 rows containing non-finite values (stat_cor).
## Warning: Removed 23 rows containing missing values (geom_point).
age5resid<-myFF %>% filter(childteen=='C')%>%mutate(age5_resid=cm5b_age-hv5_agem)%>%pull(age5_resid)
hist(age5resid)age6resid<-myFF %>% filter(childteen=='T')%>%mutate(age6_resid=cp6yagem-hv6_yagem)%>%pull(age6_resid)
ggplot(myFF%>%filter(childteen=='T'), aes(cp6yagem,hv6_yagem))+geom_point()+stat_cor()+ggtitle("Correlation between childs age at mom interview (15 yr) and childs age at home visit (15 yr)")## Warning: Removed 498 rows containing non-finite values (stat_cor).
## Warning: Removed 498 rows containing missing values (geom_point).
hist(age6resid)table(FF_labeled$cp6yagem-FF_labeled$hv6_yagem)##
## 0 5
## 1088 1
table(FF_labeled$cm5b_age-FF_labeled$hv5_agem)##
## -21 -18 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2
## 1 1 2 4 3 6 15 10 5 10 13 16 13 24 52 70
## -1 0 1 2 3 4 5 6 7 8 10 11 12 13 14 15
## 123 189 1424 1222 18 12 5 15 4 4 3 2 1 1 1 3
## 16 17 21
## 1 1 1
table(is.na(myFF$ChildAgeComposite)) # 4##
## FALSE TRUE
## 1682 3
myFF<-myFF %>% mutate(SmkAtVisitPastmonth=case_when(childteen=='C' & m5g17=='1 yes' & m5g18 %in% c("2 about a pack", "3 a pack and a half", "4 about 2 packs",
"5 more than two packs") ~ "Pack or more a day",
childteen=='C' & m5g17=='1 yes' & m5g18=="1 less than half a pack a day" ~ "Less than pack a day",
childteen=='C' & n5f17=="1 yes" & n5f18 =='2 about a pack' ~ "Pack or more a day",
childteen=='C' & n5f17=="1 yes" & n5f18 =="1 less than half a pack day" ~ "Less than pack a day",
childteen=='C' & m5g17=="1 yes" & m5g18 %in% c("-6 Skip", "Missing")~"Missing",
childteen == 'C' & n5f17%in%c("-7 N/A", "2 no") & m5g17=="2 no" ~ "No smoking",
childteen == 'C' & n5f17%in%c("2 no") & m5g17%in%c("2 no", "Missing") ~ "No smoking",
childteen == 'C' & n5f17 %in% c("Missing", "-7 N/A") & m5g17 =="Missing" ~ "Missing",
childteen=='T' & p6h76 %in% c("-6 Skip", "0 None") ~ "No smoking",
childteen=='T' & p6h77%in% c("1 5 or fewer cigarettes per day", "2 About half a pack a day, 10 cigarettes")~"Less than pack a day",
childteen=='T' & p6h77 %in% c("3 About a pack a day, 20 cigarettes", "4 More than 1 pack a day")~"Pack or more a day",
childteen=='T' & p6h77 == "Missing"~"Missing"))%>%
mutate(SmkAtVisitYesNo=case_when(SmkAtVisitPastmonth=="No smoking"~"No",
SmkAtVisitPastmonth %in% c("Less than pack a day", "Pack or more a day") ~ "Yes",
childteen=='T'& SmkAtVisitPastmonth=="Missing" & !(p6h76 %in% c("-6 Skip", "0 None", "Missing"))~"Yes",
childteen=='C' & SmkAtVisitPastmonth=="Missing" & m5g17=='1 yes'~"Yes",
childteen=='C' & SmkAtVisitPastmonth=="Missing" & m5g17!='1 yes'~"Missing",
childteen=='T'& SmkAtVisitPastmonth=="Missing" & p6h76 %in% c("-6 Skip", "0 None", "Missing")~"Missing"))
myFF %>% filter(childteen=='C') %>% xtabs(~SmkAtVisitPastmonth+SmkAtVisitYesNo, data=.)## SmkAtVisitYesNo
## SmkAtVisitPastmonth Missing No Yes
## Less than pack a day 0 0 193
## Missing 13 0 0
## No smoking 0 544 0
## Pack or more a day 0 0 78
myFF %>% filter(childteen=='T') %>% xtabs(~SmkAtVisitPastmonth+SmkAtVisitYesNo, data=.)## SmkAtVisitYesNo
## SmkAtVisitPastmonth Missing No Yes
## Less than pack a day 0 0 184
## Missing 3 0 2
## No smoking 0 628 0
## Pack or more a day 0 0 39
myFF$SmkAtVisitPastmonth<-factor(myFF$SmkAtVisitPastmonth)Exclude children who reported smoking at age 9 and age 15. Interestingly, the children who report smoking at age 9 do not report smoking at age 15, and the parental perception of child smoking at age 9 does not match the children who say they have ever smoked a cigarette. For now using the ‘ever smoked’ variable from age 15 for the exclusion, but there is also a ‘smoked in past month’ that we could use instead (lose fewer children).
with(myFF, table(k5f1l,childteen)) # 6 kids smoking at age 9 ## childteen
## k5f1l C T
## 1 yes 6 6
## 2 no 816 835
## Missing 7 15
with(myFF, table(p5q3cr, k5f1l)) # not the same kids whose parents say somewhat or very true that the child is using tobacco## k5f1l
## p5q3cr 1 yes 2 no Missing
## 1 not true 12 1608 14
## 2 somewhat or sometimes true 0 4 0
## 3 very true or often true 0 2 0
## Missing 0 37 8
with(myFF, table(k6d40, childteen)) # 41 kids say ever smoked a cigarette at age 15## childteen
## k6d40 C T
## 1 Yes 41 40
## 2 No 779 809
## Missing 9 7
#with(myFF, summary(k6d41))
with(myFF, table(k6d42, childteen)) # of these only 13 say they've smoked in the past month ## childteen
## k6d42 C T
## -6 Skip 779 809
## 1 Never 28 28
## 2 Once or twice a week 10 10
## 3 3 to 5 days a week 0 0
## 4 6 to 7 days a week 3 2
## Missing 9 7
#with(myFF, summary(k6d43))
with(myFF, table(k5f1l, k6d40, childteen))# not the same kids at age 9 and 15## , , childteen = C
##
## k6d40
## k5f1l 1 Yes 2 No Missing
## 1 yes 1 5 0
## 2 no 40 769 7
## Missing 0 5 2
##
## , , childteen = T
##
## k6d40
## k5f1l 1 Yes 2 No Missing
## 1 yes 1 5 0
## 2 no 39 791 5
## Missing 0 13 2
varLabels4<-c(varLabels3, "Child's age at maternal/PCG (9/15 yr) interview (months)", "Child's age at home vist (9/15 years) (months)", "Constructed child's age at home visit (in months) (9/15 years)", "Mother/PCG (9/15 yr) smoking in month prior to interview (pk/day)", "Mother/PCG (9/15 yr) smoking in month prior to interview (yes/no)")
myFF<-set_label(myFF, varLabels4)slide_count<-myFF %>% mutate(slide=as.factor(slide))%>% group_by(smkPreg_binary, slide, .drop=F)%>%count()%>%pivot_wider(id_cols = slide, names_from=smkPreg_binary, values_from=n)%>%arrange(Yes)%>%mutate(Proportion=Yes/(No+Yes))
hist_data<-slide_count %>% ggplot(aes(x=Proportion))+geom_histogram(binwidth = 0.1, fill='blue', alpha=0.5, breaks=seq(0, 1, 0.1))+ggtitle('Histogram proprotion smk (from data)')
hist_data_c<-slide_count %>% ggplot(aes(x=Yes))+geom_histogram( fill='blue', alpha=0.5, breaks=c(0:12))+scale_x_continuous(breaks=c(0:12))+ggtitle('Histogram # Yes smk (from data)')
kable(slide_count)%>%kable_styling()%>%scroll_box(width='100%', height='400px')| slide | No | Yes | Proportion |
|---|---|---|---|
| 200347970068 | 12 | 0 | 0.0000000 |
| 200394970127 | 12 | 0 | 0.0000000 |
| 200397540068 | 12 | 0 | 0.0000000 |
| 200490750005 | 9 | 0 | 0.0000000 |
| 200490750009 | 11 | 0 | 0.0000000 |
| 200490750011 | 11 | 0 | 0.0000000 |
| 200490750063 | 8 | 0 | 0.0000000 |
| 200490750108 | 12 | 0 | 0.0000000 |
| 200490750124 | 12 | 0 | 0.0000000 |
| 200556930100 | 11 | 0 | 0.0000000 |
| 200556930111 | 10 | 0 | 0.0000000 |
| 200770460118 | 9 | 0 | 0.0000000 |
| 200770460172 | 12 | 0 | 0.0000000 |
| 200770460173 | 10 | 0 | 0.0000000 |
| 200863810009 | 12 | 0 | 0.0000000 |
| 200863810070 | 12 | 0 | 0.0000000 |
| 200863810114 | 10 | 0 | 0.0000000 |
| 200866140081 | 9 | 0 | 0.0000000 |
| 200866140163 | 12 | 0 | 0.0000000 |
| 201502620070 | 11 | 0 | 0.0000000 |
| 201502620071 | 11 | 0 | 0.0000000 |
| 201561870093 | 4 | 0 | 0.0000000 |
| 201561870123 | 12 | 0 | 0.0000000 |
| 3999984076 | 12 | 0 | 0.0000000 |
| 3999984094 | 12 | 0 | 0.0000000 |
| 3999984111 | 12 | 0 | 0.0000000 |
| 3999997019 | 12 | 0 | 0.0000000 |
| 9979858077 | 12 | 0 | 0.0000000 |
| 9979858078 | 12 | 0 | 0.0000000 |
| 9979858102 | 10 | 0 | 0.0000000 |
| 9979858118 | 12 | 0 | 0.0000000 |
| 9979858128 | 12 | 0 | 0.0000000 |
| 9986360014 | 12 | 0 | 0.0000000 |
| 9986360015 | 12 | 0 | 0.0000000 |
| 200397860021 | 10 | 1 | 0.0909091 |
| 200397860035 | 10 | 1 | 0.0909091 |
| 200490750003 | 10 | 1 | 0.0909091 |
| 200863810112 | 7 | 1 | 0.1250000 |
| 201502620017 | 10 | 1 | 0.0909091 |
| 201561870100 | 7 | 1 | 0.1250000 |
| 3999932043 | 9 | 1 | 0.1000000 |
| 9829118090 | 9 | 1 | 0.1000000 |
| 9829118105 | 9 | 1 | 0.1000000 |
| 9986360019 | 11 | 1 | 0.0833333 |
| 200347970051 | 10 | 2 | 0.1666667 |
| 200347970066 | 10 | 2 | 0.1666667 |
| 200347970071 | 10 | 2 | 0.1666667 |
| 200394970128 | 10 | 2 | 0.1666667 |
| 200397540074 | 10 | 2 | 0.1666667 |
| 200397860024 | 10 | 2 | 0.1666667 |
| 200397860046 | 7 | 2 | 0.2222222 |
| 200397860066 | 9 | 2 | 0.1818182 |
| 200397870011 | 8 | 2 | 0.2000000 |
| 200400320071 | 10 | 2 | 0.1666667 |
| 200490750001 | 10 | 2 | 0.1666667 |
| 200490750006 | 7 | 2 | 0.2222222 |
| 200490750021 | 10 | 2 | 0.1666667 |
| 200490750053 | 9 | 2 | 0.1818182 |
| 200490750074 | 8 | 2 | 0.2000000 |
| 200490750079 | 5 | 2 | 0.2857143 |
| 200490750081 | 10 | 2 | 0.1666667 |
| 200490750106 | 7 | 2 | 0.2222222 |
| 200490750107 | 10 | 2 | 0.1666667 |
| 200490750148 | 10 | 2 | 0.1666667 |
| 200556930110 | 10 | 2 | 0.1666667 |
| 200556930121 | 9 | 2 | 0.1818182 |
| 200556930130 | 9 | 2 | 0.1818182 |
| 200770460063 | 7 | 2 | 0.2222222 |
| 200863810019 | 10 | 2 | 0.1666667 |
| 200863810029 | 10 | 2 | 0.1666667 |
| 200863810053 | 9 | 2 | 0.1818182 |
| 200866140164 | 10 | 2 | 0.1666667 |
| 200866140165 | 10 | 2 | 0.1666667 |
| 200866140214 | 10 | 2 | 0.1666667 |
| 201502620002 | 9 | 2 | 0.1818182 |
| 201502620003 | 9 | 2 | 0.1818182 |
| 201502620010 | 7 | 2 | 0.2222222 |
| 201502620024 | 8 | 2 | 0.2000000 |
| 201502620039 | 8 | 2 | 0.2000000 |
| 201502620073 | 10 | 2 | 0.1666667 |
| 201502620074 | 9 | 2 | 0.1818182 |
| 201502620078 | 10 | 2 | 0.1666667 |
| 201561870001 | 10 | 2 | 0.1666667 |
| 201561870002 | 10 | 2 | 0.1666667 |
| 201561870060 | 8 | 2 | 0.2000000 |
| 201561870068 | 8 | 2 | 0.2000000 |
| 201561870083 | 10 | 2 | 0.1666667 |
| 201561870090 | 10 | 2 | 0.1666667 |
| 201561870091 | 10 | 2 | 0.1666667 |
| 201561870092 | 10 | 2 | 0.1666667 |
| 201561870101 | 8 | 2 | 0.2000000 |
| 201561870126 | 6 | 2 | 0.2500000 |
| 201561870136 | 9 | 2 | 0.1818182 |
| 3999932048 | 9 | 2 | 0.1818182 |
| 3999932050 | 10 | 2 | 0.1666667 |
| 3999932051 | 10 | 2 | 0.1666667 |
| 3999984051 | 10 | 2 | 0.1666667 |
| 3999984077 | 10 | 2 | 0.1666667 |
| 3999984080 | 7 | 2 | 0.2222222 |
| 3999984088 | 9 | 2 | 0.1818182 |
| 3999984090 | 10 | 2 | 0.1666667 |
| 3999984108 | 10 | 2 | 0.1666667 |
| 9829118115 | 5 | 2 | 0.2857143 |
| 9829118137 | 9 | 2 | 0.1818182 |
| 9979858076 | 10 | 2 | 0.1666667 |
| 9979858105 | 10 | 2 | 0.1666667 |
| 9986360012 | 10 | 2 | 0.1666667 |
| 9986360033 | 9 | 2 | 0.1818182 |
| 200556930073 | 9 | 3 | 0.2500000 |
| 200556930095 | 9 | 3 | 0.2500000 |
| 200770460062 | 8 | 3 | 0.2727273 |
| 200863730056 | 9 | 3 | 0.2500000 |
| 201502620016 | 7 | 3 | 0.3000000 |
| 3999932006 | 5 | 3 | 0.3750000 |
| 3999932091 | 9 | 3 | 0.2500000 |
| 3999932092 | 6 | 3 | 0.3333333 |
| 200347970040 | 8 | 4 | 0.3333333 |
| 200347970050 | 6 | 4 | 0.4000000 |
| 200347970069 | 8 | 4 | 0.3333333 |
| 200397540092 | 8 | 4 | 0.3333333 |
| 200397860096 | 6 | 4 | 0.4000000 |
| 200490750077 | 8 | 4 | 0.3333333 |
| 200490750082 | 8 | 4 | 0.3333333 |
| 200490750090 | 8 | 4 | 0.3333333 |
| 200490750091 | 3 | 4 | 0.5714286 |
| 200490750093 | 8 | 4 | 0.3333333 |
| 200556930109 | 6 | 4 | 0.4000000 |
| 200863730043 | 8 | 4 | 0.3333333 |
| 200863730057 | 8 | 4 | 0.3333333 |
| 200866140028 | 6 | 4 | 0.4000000 |
| 200866140079 | 8 | 4 | 0.3333333 |
| 200866140238 | 8 | 4 | 0.3333333 |
| 3999984091 | 8 | 4 | 0.3333333 |
| 3999984112 | 8 | 4 | 0.3333333 |
| 3999997140 | 8 | 4 | 0.3333333 |
| 9979858131 | 8 | 4 | 0.3333333 |
| 9986360027 | 8 | 4 | 0.3333333 |
| 200490750095 | 7 | 5 | 0.4166667 |
| 201561870059 | 7 | 5 | 0.4166667 |
| 3999932036 | 6 | 5 | 0.4545455 |
| 200347970036 | 6 | 6 | 0.5000000 |
| 200397540069 | 6 | 6 | 0.5000000 |
| 200397860094 | 6 | 6 | 0.5000000 |
| 200490750080 | 2 | 6 | 0.7500000 |
| 200866140159 | 6 | 6 | 0.5000000 |
| 3999984067 | 6 | 6 | 0.5000000 |
| 9986360025 | 6 | 6 | 0.5000000 |
| 9986360046 | 6 | 6 | 0.5000000 |
| 200397540040 | 4 | 8 | 0.6666667 |
| 201561870003 | 4 | 8 | 0.6666667 |
| 3999984100 | 4 | 8 | 0.6666667 |
| 3999984114 | 4 | 8 | 0.6666667 |
#simulation
set.seed(seed=20210216)
n<-nrow(myFF)
yes<-table(myFF$smkPreg_binary)[2]; no<-table(myFF$smkPreg_binary)[1]
values<-data.frame('smk'=sample(c('No', 'Yes'), n, replace=T, prob=c(no/(no+yes), yes/(no+yes))))
values_slides<-values %>%
group_by((row_number()-1) %/% (n()/152)) %>%
nest()%>%unnest(cols=c(data))
colnames(values_slides)<-c('slide', 'smk')
slide_sim<-values_slides%>%mutate(slide=as.factor(slide)) %>% group_by(smk, slide, .drop=F)%>%count()%>%pivot_wider(id_cols = slide, names_from=smk, values_from=n)%>%arrange(Yes)%>%mutate(Proportion=Yes/(No+Yes))
hist_sim<-slide_sim%>% ggplot(aes(x=Proportion))+geom_histogram(binwidth = 0.1, fill='red', alpha=0.5, breaks=seq(0, 1, 0.1))+ggtitle('Histogram proprotion smk (from simulation)')
hist_sim_c<-slide_sim%>% ggplot(aes(x=Yes))+geom_histogram(fill='red', alpha=0.5, breaks=c(0:12))+scale_x_continuous(breaks=c(0:12))+ggtitle('Histogram # Yes smk (from simulation)')
hist_overlay<-slide_count %>% ggplot(aes(x=Proportion))+geom_histogram(fill='blue', binwidth = 0.1, alpha=0.5, breaks=seq(0, 1, 0.1))+geom_histogram(aes(x=slide_sim$Proportion), breaks=seq(0, 1, 0.1), fill='red', binwidth = 0.1, alpha=0.5)+xlim(0,1)
hist_overlay_c<-slide_count %>% ggplot(aes(x=Yes))+geom_histogram(fill='blue', alpha=0.5, breaks=c(0:12))+geom_histogram(aes(x=slide_sim$Yes), fill='red', alpha=0.5, breaks=c(0:12))+scale_x_continuous(breaks=c(0:12))
#plot_grid(plot_grid(hist_data, hist_sim, nrow=2), hist_overlay, nrow=1)
plot_grid(plot_grid(hist_data_c, hist_sim_c, nrow=2), hist_overlay_c, nrow=1)This was added 05/05/21 following 04/17/21 email from Jonah with batch variables from Princeton (thanks Jonah!)
Platechisq<-chisq.test(table(myFF$Sample_Plate, myFF$smkPreg_binary))
kable(table(myFF$Sample_Plate, myFF$smkPreg_binary))| No | Yes | |
|---|---|---|
| 1 | 32 | 6 |
| 2 | 73 | 22 |
| 3 | 86 | 8 |
| 4 | 74 | 21 |
| 5 | 69 | 24 |
| 6 | 64 | 21 |
| 7 | 70 | 24 |
| 8 | 72 | 24 |
| 9 | 66 | 20 |
| 10 | 58 | 16 |
| 11 | 63 | 24 |
| 12 | 82 | 12 |
| 13 | 73 | 16 |
| 14 | 71 | 16 |
| 15 | 73 | 18 |
| 16 | 76 | 13 |
| 17 | 69 | 14 |
| 18 | 76 | 18 |
| 19 | 67 | 17 |
| 23 | 30 | 7 |
Prenatal maternal smoking is not significantly associated with plate/batch ($^2=$24.65; P-value=0.17 )
methylCohort<-pdqc_all %>% filter(childteen!='M')
basemodelvar<-c("smkPreg_binary", "ancestry", "cm1bsex", "cm1inpov", "Epi", "IC", "ChildAgeComposite")
basemodeldata<-myFF %>% filter_at(all_of(basemodelvar), all_vars(!(. %in%c("Missing", "Missing PC data"))& !is.na(.)))
child_base<-basemodeldata%>%filter(childteen=='C')
teen_base<-basemodeldata%>%filter(childteen=='T')
prenatalexposurevar<-c(basemodelvar, "m1g2_YesNoPreg", "m1g3_YesNoPreg")
prenatalexdata<-basemodeldata %>%filter_at(all_of(prenatalexposurevar), all_vars(!(. %in% c("Missing"))))
child_prenatal<-prenatalexdata%>%filter(childteen=='C')
teen_prenatal<-prenatalexdata%>%filter(childteen=='T')
secondhandsmokevars<-c(prenatalexposurevar, "PostnatalMaternalSmokingAny", "SmkAtVisitPastmonth")
secondhandsmkdata<-prenatalexdata%>%filter_at(all_of(secondhandsmokevars), all_vars(!(. %in% c("Missing"))))
child_secondhand<-secondhandsmkdata %>% filter(childteen=='C')
teen_secondhand<-secondhandsmkdata%>%filter(childteen=='T')
twovisit_secondhand<-secondhandsmkdata %>% filter(idnum %in% child_secondhand$idnum & idnum %in% teen_secondhand$idnum)
child_smoke<-secondhandsmkdata %>% filter(k5f1l!= "2 no" & childteen=='C')
child_nosmoke<-secondhandsmkdata %>% filter(k5f1l=="2 no" & childteen=='C')
teen_smoke<-secondhandsmkdata %>% filter(k6d40!="2 No" & childteen=='T')
teen_nosmoke<-secondhandsmkdata %>% filter(k6d40=="2 No" & childteen=='T')
twovisit_nosmoke<-secondhandsmkdata %>% filter(idnum %in% child_nosmoke$idnum & idnum %in% teen_nosmoke$idnum)
fathersmkdata<-secondhandsmkdata %>% filter(f1g4!="Missing")
#Make Labels
n_fullFF<-paste0("Fragile Families cohort \n", "n=", FF_labeled %>% nrow())
n_methylQC<-paste0("Cohort with 450K methylation data \n", "n=", methylCohort$idnum%>%unique()%>%length(), "individuals with m=", methylCohort$MethID%>%length(), "samples")
n_QCloss<-paste0("Methylation data QC: 12 low-intensity samples dropped, \n", pdqc_all%>%filter(probe_fail_10==T)%>%nrow()-12, " samples with >10% of probes failing dropped, \n", pdqc_all%>%filter(sex!=predicted_sex & probe_fail_10==F & childteen!='M')%>%nrow(), " discordant sex samples dropped, \n & ", nrow(pdqc)-nrow(pdqc_clean), " duplicate samples dropped")
n_methylFF<-paste0("Cohort with quality controlled 450K data \n","n=", myFF %>% select(id) %>% unique() %>% nrow(), " individuals with m=", myFF%>%nrow(), " samples")
n_basemodel<-paste0("Base models: \n", "n=", basemodeldata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", basemodeldata %>%nrow(), " samples")
n_baseexclusions<-paste0("n=", myFF %>% filter_at(all_of(basemodelvar), any_vars((. %in%c("Missing", "Missing PC data"))))%>% select(id) %>% unique() %>% nrow(), " individuals & m=", myFF %>% filter_at(all_of(basemodelvar), any_vars((. %in%c("Missing", "Missing PC data")))) %>% nrow(), " samples missing principal components of ancestry; \n n=", myFF %>% filter_at(all_of(basemodelvar), any_vars((is.na(.))))%>%select(id)%>%unique()%>%nrow(), " individuals & m=", myFF %>% filter_at(all_of(basemodelvar), any_vars((is.na(.))))%>%nrow(), " samples missing child age at maternal/PCG interview")
n_prenatalexposures<-paste0("Prenatal exposures models: \n","n=", prenatalexdata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", prenatalexdata %>%nrow(), " samples")
n_prenatalexclusions<-paste0("n=", basemodeldata %>% filter_at(all_of(prenatalexposurevar), any_vars(. %in% c("Missing"))) %>% select(id) %>% unique() %>% nrow(), " individuals & m=", basemodeldata %>% filter_at(all_of(prenatalexposurevar), any_vars(. %in% c("Missing"))) %>% nrow(), " samples missing \n data on maternal prenatal alcohol and drug use")
n_secondhandSmk<-paste0("Secondhand smoke exposure models: \n","n=", secondhandsmkdata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", secondhandsmkdata %>%nrow(), " samples")
n_secondhandexclusions<-paste0("n=", prenatalexdata$id[!(prenatalexdata$id %in% secondhandsmkdata$id)] %>% unique()%>% length(), " individuals & m=", prenatalexdata %>% filter_at(all_of(secondhandsmokevars), any_vars(. %in% c("Missing"))) %>% nrow(), " samples missing data \n on maternal or primary care giver postnatal smoking")
n_Child<-paste0('n=', child_secondhand %>% nrow(), " individuals with a 9 year visit")
n_ChildSmkexcluded<-paste0("Exclude children who smoke: \n",'n=', child_nosmoke %>% nrow(), " individuals with a 9 year visit")
n_ChildSmkexclusions<-paste0('n=', child_smoke%>%nrow(), " individual \n smoking at 9 year visit \n or missing focal child smoking data")
n_Teen<-paste0('n=', teen_secondhand %>% nrow(), " individuals with a 15 year visit")
n_TeenSmkexcluded<-paste0("Exclude children who smoke: \n",'n=', teen_nosmoke %>% nrow(), " individuals with a 15 year visit")
n_TeenSmkexclusions<-paste0('n=', teen_smoke%>%nrow(), " individuals \n smoking at 15 year visit \n or missing focal child smoking data")
n_sensFatherSmoking<-paste0("Sensitivity analysis: compare effects of maternal and paternal prenatal smoking \n", "n=", fathersmkdata %>% select(id) %>% unique() %>% nrow(), " individuals with m=", fathersmkdata %>%nrow(), " samples", "\n with prenatal paternal smoking data")
n_sensFatherSmokingEx<- paste0("n=", secondhandsmkdata$id[!(secondhandsmkdata$id %in% fathersmkdata$id)] %>% unique()%>% length(), " individuals & m=", secondhandsmkdata %>% filter(f1g4 %in% c("Missing")) %>% nrow(), " samples missing data on paternal prenatal smoking")
n_twovisits<-paste0("n=", twovisit_secondhand%>%pull(idnum) %>%unique() %>%length(), " individuals with both a 9 and 15 year visit")
n_twovisitsSmkExcluded<-paste0("Exclude children who smoke: \n", "n=", twovisit_nosmoke%>%pull(idnum) %>%unique() %>%length(), " individuals with both a 9 and 15 year visit")
n_twovisitSmkExclusions<-paste0("n=", (twovisit_secondhand%>%pull(idnum) %>%unique() %>%length())-(twovisit_nosmoke%>%pull(idnum) %>%unique() %>%length()), " individuals smoking at \n either 9 and 15 year visit or \n missing focal child smoking data")graph1<-DiagrammeR::grViz("
digraph graph2 {
compound=true;
graph [layout = dot]
node [shape = rectangle, width=4]
a [label = '@@1']
b [label = '@@2']
c [label = '@@3']
d [label = '@@4']
e [label = '@@5']
f [label = '@@6']
g [label = '@@7']
h [label = '@@8']
i [label = '@@9', group=g1]
j [label = '@@10']
k [label = '@@16']
l [label = '@@17']
m [label = '@@19']
#fake nodes for connecting to edges
node [shape=point, width=0.01, height=0.01]
a1
b1
c1
d1
# e1
f1 [group=g1]
g1
k1
#exclusion nodes
node [shape = rectangle, width=4]
aa [label = '@@20']
bb [label = '@@11']
cc [label = '@@12']
dd [label = '@@13']
ff [label = '@@14', width=3]
gg [label = '@@15', width=3]
kk [label = '@@18', width=3]
#draw connections
a->m
m->a1 [arrowhead=none]
a1->b
b->b1 [arrowhead=none]
b1->c
c->c1 [arrowhead=none]
c1->d
d->d1 [arrowhead=none]
d1->e
e->f; e->g; e->k
{rank=same; h->e [dir=back, minlen=2, style=dashed]}
f->f1 [arrowhead=none]
g->g1 [arrowhead=none]
k->k1 [arrowhead=none]
f1->i
g1->j
k1->l
{rank=same; a1->aa [minlen=2]}
{rank=same; b1->bb [minlen=2]}
{rank=same; c1->cc [minlen=2]}
{rank=same; d1->dd [minlen=2]}
{rank=same; f1->ff [minlen=2]}
{rank=same; g1->gg [minlen=2]}
{rank=same; k1->kk [minlen=2]}
#invisible constraints to force ordering
ff->i [style=invis]
}
[1]: n_fullFF
[2]: n_methylFF
[3]: n_basemodel
[4]: n_prenatalexposures
[5]: n_secondhandSmk
[6]: n_Child
[7]: n_Teen
[8]: n_sensFatherSmoking
[9]: n_ChildSmkexcluded
[10]: n_TeenSmkexcluded
[11]: n_baseexclusions
[12]: n_prenatalexclusions
[13]: n_secondhandexclusions
[14]: n_ChildSmkexclusions
[15]: n_TeenSmkexclusions
[16]: n_twovisits
[17]: n_twovisitsSmkExcluded
[18]: n_twovisitSmkExclusions
[19]: n_methylQC
[20]: n_QCloss
")graph1analysis_data<-rbind(child_nosmoke, teen_nosmoke)methyl1<-createTable(compareGroups(MethylData~smkPreg_binary+cm1inpov+cm1ethrace+cm1bsex+m1g2_YesNoPreg+m1g3_YesNoPreg+PostnatalMaternalSmokingAny+k5f1l+k6d40+f1g4, data=FF_labeled), show.all=T)
export2md(methyl1, caption="Exclusion table of all Fragile Families vs those with quality controlled methylation data for main predictors and key demographic variables from first wave")| [ALL] | Analysis subset | Not in analysis subset | p.overall | |
|---|---|---|---|---|
| N=4898 | N=880 | N=4018 | ||
| Any maternal smoking during pregnancy: | 0.230 | |||
| Missing | 12 (0.24%) | 0 (0.00%) | 12 (0.30%) | |
| No | 3934 (80.3%) | 702 (79.8%) | 3232 (80.4%) | |
| Yes | 952 (19.4%) | 178 (20.2%) | 774 (19.3%) | |
| cm1inpov Constructed - Poverty ratio - mother’s household income/poverty threshold | 2.22 (2.41) | 2.23 (2.43) | 2.22 (2.41) | 0.981 |
| cm1ethrace Mother race (baseline own report): | . | |||
| 1 white, non-hispanic | 1030 (21.0%) | 174 (19.8%) | 856 (21.3%) | |
| 2 black, non-hispanic | 2326 (47.5%) | 485 (55.1%) | 1841 (45.8%) | |
| 3 hispanic | 1336 (27.3%) | 188 (21.4%) | 1148 (28.6%) | |
| 4 other | 194 (3.96%) | 31 (3.52%) | 163 (4.06%) | |
| Missing | 12 (0.24%) | 2 (0.23%) | 10 (0.25%) | |
| cm1bsex Constructed - Focal baby’s gender: | 0.348 | |||
| 1 Boy | 2568 (52.4%) | 444 (50.5%) | 2124 (52.9%) | |
| 2 Girl | 2329 (47.6%) | 436 (49.5%) | 1893 (47.1%) | |
| Missing | 1 (0.02%) | 0 (0.00%) | 1 (0.02%) | |
| Any alcohol during pregnancy: | 0.635 | |||
| Missing | 13 (0.27%) | 2 (0.23%) | 11 (0.27%) | |
| No | 4364 (89.1%) | 777 (88.3%) | 3587 (89.3%) | |
| Yes | 521 (10.6%) | 101 (11.5%) | 420 (10.5%) | |
| Any drugs during pregnancy: | 0.478 | |||
| Missing | 11 (0.22%) | 3 (0.34%) | 8 (0.20%) | |
| No | 4616 (94.2%) | 833 (94.7%) | 3783 (94.2%) | |
| Yes | 271 (5.53%) | 44 (5.00%) | 227 (5.65%) | |
| Any postnatal maternal smoking (age 1 & 5): | <0.001 | |||
| Maternal smoking at age 1 or age 5 | 1545 (31.5%) | 333 (37.8%) | 1212 (30.2%) | |
| Missing | 827 (16.9%) | 47 (5.34%) | 780 (19.4%) | |
| No maternal smoking at age 1 and age 5 | 2526 (51.6%) | 500 (56.8%) | 2026 (50.4%) | |
| k5f1l F1L. Smoked a cigarette or used tobacco: | <0.001 | |||
| 1 yes | 26 (0.53%) | 6 (0.68%) | 20 (0.50%) | |
| 2 no | 3313 (67.6%) | 859 (97.6%) | 2454 (61.1%) | |
| Missing | 1559 (31.8%) | 15 (1.70%) | 1544 (38.4%) | |
| k6d40 D40. Ever smoked an entire cigarette?: | <0.001 | |||
| 1 Yes | 185 (3.78%) | 41 (4.66%) | 144 (3.58%) | |
| 2 No | 3244 (66.2%) | 829 (94.2%) | 2415 (60.1%) | |
| Missing | 1469 (30.0%) | 10 (1.14%) | 1459 (36.3%) | |
| f1g4 In the past 3 months, how many cigarettes did you smoke a day?: | 0.029 | |||
| 1 2+pk/d | 62 (1.27%) | 11 (1.25%) | 51 (1.27%) | |
| 2 1<pk<2 | 364 (7.43%) | 73 (8.30%) | 291 (7.24%) | |
| 3 <1pk/d | 1067 (21.8%) | 204 (23.2%) | 863 (21.5%) | |
| 4 None | 2327 (47.5%) | 434 (49.3%) | 1893 (47.1%) | |
| Missing | 1078 (22.0%) | 158 (18.0%) | 920 (22.9%) |
myFF$MethylData2<-ifelse(myFF$MethID %in% c(child_nosmoke$MethID, teen_nosmoke$MethID), 'Final analysis group', 'Not in final analysis group')
set_label(myFF$MethylData2)<-'Exclusions based on missing data'
methyl2<-createTable(compareGroups(MethylData2~smkPreg_binary+globalmethylation+pediatric+Epi+IC+anynewborn_center+SSnewbornCT_center+SSolder_center+cm1inpov+ancestry+cm1bsex+m1g2_YesNoPreg+m1g3_YesNoPreg+PostnatalMaternalSmokingAny+SmkAtVisitPastmonth, data=myFF), show.all=T)
export2md(strataTable(methyl2, 'childteen'), caption="Exclusion table of subset with quality controlled methylation data vs final analytic subset for main predictor, outcomes and key demographic variables")| [ALL] | Final analysis group | Not in final analysis group | p.overall | [ALL] | Final analysis group | Not in final analysis group | p.overall | |
|---|---|---|---|---|---|---|---|---|
| N=829 | N=723 | N=106 | N=856 | N=720 | N=136 | |||
| Any maternal smoking during pregnancy: | 0.608 | 0.062 | ||||||
| No | 661 (79.7%) | 574 (79.4%) | 87 (82.1%) | 683 (79.8%) | 583 (81.0%) | 100 (73.5%) | ||
| Yes | 168 (20.3%) | 149 (20.6%) | 19 (17.9%) | 173 (20.2%) | 137 (19.0%) | 36 (26.5%) | ||
| Global methylation | 0.48 (0.01) | 0.48 (0.01) | 0.48 (0.01) | 0.458 | 0.48 (0.01) | 0.48 (0.01) | 0.48 (0.01) | 0.092 |
| pediatric methylation clock | 8.99 (0.92) | 8.99 (0.94) | 8.99 (0.78) | 0.973 | 11.6 (1.73) | 11.6 (1.75) | 11.9 (1.61) | 0.080 |
| Epithelial cell proportion | 0.27 (0.13) | 0.27 (0.13) | 0.29 (0.16) | 0.353 | 0.31 (0.16) | 0.30 (0.16) | 0.33 (0.15) | 0.037 |
| Immune cell proportion | 0.71 (0.13) | 0.71 (0.12) | 0.70 (0.15) | 0.408 | 0.68 (0.15) | 0.69 (0.15) | 0.66 (0.15) | 0.036 |
| PES: Any smoking, newborn cordblood - mean center | -0.02 (0.22) | -0.02 (0.21) | -0.02 (0.23) | 0.921 | 0.02 (0.23) | 0.02 (0.23) | 0.04 (0.23) | 0.316 |
| PES: Sustained smoking, newborn cordblood (+CT control) - mean center | -0.01 (0.23) | -0.01 (0.23) | -0.03 (0.26) | 0.497 | 0.02 (0.23) | 0.02 (0.23) | 0.01 (0.23) | 0.576 |
| PES: Sustained smoking, older children peripheral blood - mean center | 0.00 (0.25) | 0.01 (0.24) | -0.02 (0.30) | 0.297 | 0.00 (0.28) | 0.01 (0.28) | -0.04 (0.28) | 0.051 |
| cm1inpov Constructed - Poverty ratio - mother’s household income/poverty threshold | 2.25 (2.45) | 2.21 (2.41) | 2.49 (2.69) | 0.318 | 2.23 (2.44) | 2.21 (2.43) | 2.32 (2.45) | 0.644 |
| Ancestry from PCA of child’s genetic data: | <0.001 | <0.001 | ||||||
| African ancestry | 484 (58.4%) | 452 (62.5%) | 32 (30.2%) | 496 (57.9%) | 449 (62.4%) | 47 (34.6%) | ||
| European ancestry | 124 (15.0%) | 115 (15.9%) | 9 (8.49%) | 132 (15.4%) | 119 (16.5%) | 13 (9.56%) | ||
| Hispanic ancestry | 180 (21.7%) | 156 (21.6%) | 24 (22.6%) | 182 (21.3%) | 152 (21.1%) | 30 (22.1%) | ||
| Missing PC data | 41 (4.95%) | 0 (0.00%) | 41 (38.7%) | 46 (5.37%) | 0 (0.00%) | 46 (33.8%) | ||
| cm1bsex Constructed - Focal baby’s gender: | 0.119 | 0.003 | ||||||
| 1 Boy | 417 (50.3%) | 356 (49.2%) | 61 (57.5%) | 427 (49.9%) | 343 (47.6%) | 84 (61.8%) | ||
| 2 Girl | 412 (49.7%) | 367 (50.8%) | 45 (42.5%) | 429 (50.1%) | 377 (52.4%) | 52 (38.2%) | ||
| Missing | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | ||
| Any alcohol during pregnancy: | 0.002 | 0.043 | ||||||
| Missing | 2 (0.24%) | 0 (0.00%) | 2 (1.89%) | 2 (0.23%) | 0 (0.00%) | 2 (1.47%) | ||
| No | 731 (88.2%) | 633 (87.6%) | 98 (92.5%) | 757 (88.4%) | 638 (88.6%) | 119 (87.5%) | ||
| Yes | 96 (11.6%) | 90 (12.4%) | 6 (5.66%) | 97 (11.3%) | 82 (11.4%) | 15 (11.0%) | ||
| Any drugs during pregnancy: | 0.003 | 0.007 | ||||||
| Missing | 3 (0.36%) | 0 (0.00%) | 3 (2.83%) | 3 (0.35%) | 0 (0.00%) | 3 (2.21%) | ||
| No | 785 (94.7%) | 687 (95.0%) | 98 (92.5%) | 810 (94.6%) | 683 (94.9%) | 127 (93.4%) | ||
| Yes | 41 (4.95%) | 36 (4.98%) | 5 (4.72%) | 43 (5.02%) | 37 (5.14%) | 6 (4.41%) | ||
| Any postnatal maternal smoking (age 1 & 5): | <0.001 | <0.001 | ||||||
| Maternal smoking at age 1 or age 5 | 315 (38.0%) | 288 (39.8%) | 27 (25.5%) | 323 (37.7%) | 281 (39.0%) | 42 (30.9%) | ||
| Missing | 41 (4.95%) | 0 (0.00%) | 41 (38.7%) | 47 (5.49%) | 0 (0.00%) | 47 (34.6%) | ||
| No maternal smoking at age 1 and age 5 | 473 (57.1%) | 435 (60.2%) | 38 (35.8%) | 486 (56.8%) | 439 (61.0%) | 47 (34.6%) | ||
| Mother/PCG (9/15 yr) smoking in month prior to interview (pk/day): | <0.001 | <0.001 | ||||||
| Less than pack a day | 193 (23.3%) | 173 (23.9%) | 20 (19.0%) | 184 (21.5%) | 157 (21.8%) | 27 (19.9%) | ||
| Missing | 13 (1.57%) | 0 (0.00%) | 13 (12.4%) | 5 (0.58%) | 0 (0.00%) | 5 (3.68%) | ||
| No smoking | 544 (65.7%) | 481 (66.5%) | 63 (60.0%) | 628 (73.4%) | 533 (74.0%) | 95 (69.9%) | ||
| Pack or more a day | 78 (9.42%) | 69 (9.54%) | 9 (8.57%) | 39 (4.56%) | 30 (4.17%) | 9 (6.62%) |
Variables
Base model: Binary smoking variable [X], ancestry categorization, baby sex [X], maternal income to poverty ratio [X], child’s age in month at visit of interest (note, currently using a constructed child’s age at visit, where individuals having a missing child’s age at home visit was supplemented with child’s age at home interview) [X], cell types (Epi + IC, may need to take one out),
Prenatal exposures: Base model variables + prenatal maternal alcohol use (Yes/no) [X], prenatal maternal drug use (yes/no)
Secondhand smoke exposure models: Prenatal exposures model+ any maternal smoking at age 1 & 5 [X], maternal or PCG smoking dose at visit of interest (age 9 or 15) [X]
First hand smoking: Secondhand smoke exposure models + exclude children with missing or reported firsthand smoking at visit of interest (age 9 or 15)
Paternal smoking: Filter to individuals with data on prenatal paternal smoking and compare effect size of prenatal paternal to prenatal maternal (as per Wiklund et al. 2019, idea is that this captures additional confounding)
Ancestry specific models - by ancestry groups; replace ancestry control with control for local PCs within ancestry group
Longitduinal models - no additional filtering needed
save(child_base, file=paste0(datadir, 'CreatedData/', 'ChildBaseModelData.Rdata'))
save(teen_base, file=paste0(datadir, 'CreatedData/', 'ChildBaseModelData.Rdata'))
save(basemodeldata, file=paste0(datadir, 'CreatedData/', 'BaseModelData.Rdata'))
save(child_prenatal, file=paste0(datadir, 'CreatedData/', 'ChildPrenatalExModelData.Rdata'))
save(teen_prenatal, file=paste0(datadir, 'CreatedData/', 'PrenatalExModelData.Rdata'))
save(prenatalexdata, file=paste0(datadir, 'CreatedData/', 'PrenatalExModelData.Rdata'))
save(child_secondhand, file=paste0(datadir, 'CreatedData/', 'ChildSecondHandSmkModelData.Rdata'))
save(teen_secondhand, file=paste0(datadir, 'CreatedData/', 'TeenSecondHandSmkModelData.Rdata'))
save(twovisit_secondhand, file=paste0(datadir, 'CreatedData/', 'TwoVisitSecondHandSmkModelData.Rdata'))
save(secondhandsmkdata, file=paste0(datadir, 'CreatedData/', 'SecondHandSmkModelData.Rdata'))
save(child_nosmoke, file=paste0(datadir, 'CreatedData/', 'ChildNoSmkModelData.Rdata'))
save(teen_nosmoke, file=paste0(datadir, 'CreatedData/', 'TeenNoSmkModelData.Rdata'))
save(twovisit_nosmoke, file=paste0(datadir, 'CreatedData/', 'TwoVisitNoSmkModelData.Rdata'))
save(fathersmkdata, file=paste0(datadir, 'CreatedData/', 'FatherSmkModelData.Rdata'))
save(list=c("basemodeldata", "prenatalexdata", "secondhandsmkdata", "child_secondhand", "teen_secondhand", "twovisit_secondhand", "twovisit_nosmoke", "child_nosmoke", "teen_nosmoke", "fathersmkdata"), file=paste0(datadir, 'CreatedData/', Sys.Date(), 'CreatedData.Rdata'))